Take Home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods

Author

Harith Oh Yee Choon

Published

March 25, 2023

Modified

March 26, 2023

Background Information

Housing is an essential component of household wealth worldwide. Be it for a couple or individual, buying a house has always been a dream or goal to stay for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

The Objective

In this take-home exercise, I am tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. I am also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

1. Preparing Datasets

  • Aspatial dataset:

    • HDB Resale data: a list of HDB resale transacted prices in Singapore from Jan 2017 onwards. It is in csv format which can be downloaded from Data.gov.sg.
  • Geospatial dataset:

    • MP14_SUBZONE_WEB_PL: a polygon feature data providing information of URA 2014 Master Plan Planning Subzone boundary data. It is in ESRI shapefile format. This data set was also downloaded from Data.gov.sg
  • Location factors with regards to the geographic coordinates:

    • Downloaded from Data.gov.sg.

      • Eldercare data is a list of eldercare in Singapore. It is in shapefile format.

      • Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.

      • Parks data is a list of parks in Singapore. It is in geojson format.

      • Supermarket data is a list of supermarkets in Singapore. It is in geojson format.

      • CHAS clinics data is a list of CHAS clinics in Singapore. It is in geojson format.

      • Childcare service data is a list of childcare services in Singapore. It is in geojson format.

      • Kindergartens data is a list of kindergartens in Singapore. It is in geojson format.

    • Downloaded from Datamall.lta.gov.sg.

      • MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.

      • Bus stops data is a list of bus stops in Singapore. It is in shapefile format.

  • Location factors without geographic coordinates:

    • Downloaded from Data.gov.sg.

      • Primary school data is extracted from the list on General information of schools from data.gov portal. It is in csv format.
    • Retrieved/Scraped from other sources

      • CBD coordinates obtained from Google.

      • Shopping malls data is a list of Shopping malls in Singapore obtained from Wikipedia.

      • Good primary schools is a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum.

2. Loading the R packages

The following code chunk will perform the following task. A list called packages will be created and will consists of all the R packages required to accomplish this exercise. There will be a check to see if R packages on package have been installed in R and if not, they will be installed. After which when all the R packages have been installed, the packages will then be loaded

packages <- c('sf', 'tidyverse', 'tmap', 'httr', 'jsonlite', 'rvest', 
              'sp', 'ggpubr', 'corrplot', 'broom',  'olsrr', 'spdep', 
              'GWmodel', 'devtools', 'rgeos', 'lwgeom', 'maptools', 'rsample', 'Metrics', 'SpatialML')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p, repos = "http://cran.us.r-project.org")
  }
  library(p, character.only = T)
}

Also, you are required to download xaringanExtra from a github repository. To avoid bad credentials when downloading from the Github repository, do retrieve your personal token to fill under pat.Once loaded, the code will prompt you options to download, type ‘1’ in the console section to download all.

#Load the 'remotes' package
library(remotes)

#Set the GitHub repository and personal access token
repo <- "gadenbuie/xaringanExtra"
pat <- "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"

#Install the package from GitHub using the personal access token
devtools::install_github(repo, auth_token = pat, force= TRUE)
xfun      (0.37  -> 0.38 ) [CRAN]
htmltools (0.5.4 -> 0.5.5) [CRAN]

  There are binary versions available but the source versions are later:
          binary source needs_compilation
xfun        0.37   0.38              TRUE
htmltools  0.5.4  0.5.5              TRUE

── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file 'C:\Users\Harith Oh\AppData\Local\Temp\RtmpWi3bBD\remotes5e04196c4e2f\gadenbuie-xaringanExtra-f394e92/DESCRIPTION' ... OK
* preparing 'xaringanExtra':
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'xaringanExtra_0.7.0.9000.tar.gz'

More on the packages used:

sf: used for importing, managing, and processing geospatial data

tidyverse: used for importing, wrangling and visualising data. It consists of a family of R packages, such as:

readr for importing csv data,

readxl for importing Excel worksheet,

tidyr for manipulating data,

dplyr for transforming data, and

ggplot2 for visualising data

tmap: provides functions for plotting cartographic quality static inpoint patterns maps or interactive maps by using leaflet API.

httr: Useful tools for working with HTTP organised by HTTP verbs (GET(), POST(), etc). Configuration functions make it easy to control additional request components (authenticate(), add_headers() and so on).

In this analysis, it will be used to send GET requests to OneMapAPI SG to retrieve the coordinates of addresses.

jsonlite: A simple and robust JSON parser and generator for R. It offers simple, flexible tools for working with JSON in R, and is particularly powerful for building pipelines and interacting with a web API.

rvest: A new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup.

In this analysis, it will be used to scrape data for shopping malls and good primary schools

sp: provides classes and methods for dealing with spatial data in R.

ggpubr: provides some easy-to-use functions for creating and customizing ggplot2 based publication ready plots

In this analysis, it will be used to arrange multiple ggplots.

corrplot: For Multivariate data visualisation and analysis

broom: Takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibble.

In this analysis, functions like tidy and glance will be used to construct a tibble / summmary of the model which is easier to look at.

oslrr: Used to build OLD and performing diagnostic tests.

spdep: For spatial dependence statistics.

GWmodel: Calibrate geographical weighted family of modes.

devtools: used for installing any R packages which is not available in RCRAN. In this exercise, I will be installing using devtools to install the package xaringanExtra which is still under development stage.

xaringanExtra: is an enhancement of xaringan package. As it is still under development stage, we can still install the current version using install_github function of devtools. This package will be used to add Panelsets to contain both the r code chunk and results whereever applicable.

#3. Importing Aspatial Data & Wrangling

read_csv() function of readr package will be used to import resale-flat-prices into R as a tibble data frame called resale. glimpse() function of dplyr package is used to display the data structure

resale <- read_csv("data/aspatial/resale-flat-prices.csv")
glimpse(resale)
Rows: 148,373
Columns: 11
$ month               <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block               <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range        <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm      <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model          <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease     <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price        <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…

Once loaded, you can see that the dataset contains 11 columns with 148,373 rows. This include columns which are: month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.

For this take home exercise 3, we are allowed the option to choose to perform our analysis between either 3, 4 or 5 room flat transactions. Therefore, I will be selecting the 3 room flat transactions during the transaction period from 1st January 2021 to 31st December 2022. Test data should be included for January and February 2023 resale prices.

##3.1 Filtering HDB Resale Data

filter() function of dplyr package will be used to select the desired flat_type and dates which will be stored in rs_subset.

rs_subset <-  filter(resale,flat_type == "3 ROOM") %>% 
              filter(month >= "2021-01" & month <= "2023-02")

To check if the have been extracted flat_type and month have been extracted successfully, unique() function of R package will be used.

unique(rs_subset$month)
 [1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
 [8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12" "2023-01" "2023-02"
unique(rs_subset$flat_type)
[1] "3 ROOM"

glimpse() function will be used after to take a look at the overall resale transactions available for 3 room flat in Singapore.

glimpse(rs_subset)
Rows: 13,780
Columns: 11
$ month               <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block               <chr> "331", "534", "561", "170", "463", "542", "170", "…
$ street_name         <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO K…
$ storey_range        <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm      <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 73…
$ flat_model          <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1980, 1980, 1986, 1980, 1981, 1986, 1976, 19…
$ remaining_lease     <chr> "59 years", "58 years 02 months", "58 years 01 mon…
$ resale_price        <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 27…

As shown above, we can see that from Jan 2021 to December 2022, there are 23,656 transactions for 3 room flat in Singapore.

##3.2 Transforming HDB Resale Data Columns

Here, mutate function of dplyr package will be used to create columns such as:

address: concatenation of the block and street_name columns using paste() function of base R package. remaining_lease_yr & remaining_lease_mth: split the year and months part of the remaining_lease respectively using str_sub() function of stringr package then converting the character to integer using as.integer() function of base R package. After performing mutate function, we will store the new data in rs_transform

rs_transform <- rs_subset %>%
  mutate(rs_subset, address = paste(block,street_name)) %>%
  mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2)))%>%
  mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))
head(rs_transform)
# A tibble: 6 × 14
  month town  flat_type block street_name storey_range floor_area_sqm flat_model
  <chr> <chr> <chr>     <chr> <chr>       <chr>                 <dbl> <chr>     
1 2021… ANG … 3 ROOM    331   ANG MO KIO… 04 TO 06                 68 New Gener…
2 2021… ANG … 3 ROOM    534   ANG MO KIO… 04 TO 06                 68 New Gener…
3 2021… ANG … 3 ROOM    561   ANG MO KIO… 01 TO 03                 68 New Gener…
4 2021… ANG … 3 ROOM    170   ANG MO KIO… 07 TO 09                 60 Improved  
5 2021… ANG … 3 ROOM    463   ANG MO KIO… 04 TO 06                 68 New Gener…
6 2021… ANG … 3 ROOM    542   ANG MO KIO… 04 TO 06                 68 New Gener…
# ℹ 6 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
#   resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
#   remaining_lease_mth <int>

As observed, There are some empty values in remaining lease months with value of 0. We need to multiply the remaining_lease_yr by 12 to convert into months.

By using rowSums() of R package.The remaining_lease_mths column will be created using mutate function of dplyr package which contains the summation of the remaining_lease_yr and remaining_lease_mths.

rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform <- rs_transform %>% 
  mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
  select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, 
         lease_commence_date, remaining_lease_mths, resale_price)
head(rs_transform)
# A tibble: 6 × 12
  month   town   address block street_name flat_type storey_range floor_area_sqm
  <chr>   <chr>  <chr>   <chr> <chr>       <chr>     <chr>                 <dbl>
1 2021-01 ANG M… 331 AN… 331   ANG MO KIO… 3 ROOM    04 TO 06                 68
2 2021-01 ANG M… 534 AN… 534   ANG MO KIO… 3 ROOM    04 TO 06                 68
3 2021-01 ANG M… 561 AN… 561   ANG MO KIO… 3 ROOM    01 TO 03                 68
4 2021-01 ANG M… 170 AN… 170   ANG MO KIO… 3 ROOM    07 TO 09                 60
5 2021-01 ANG M… 463 AN… 463   ANG MO KIO… 3 ROOM    04 TO 06                 68
6 2021-01 ANG M… 542 AN… 542   ANG MO KIO… 3 ROOM    04 TO 06                 68
# ℹ 4 more variables: flat_model <chr>, lease_commence_date <dbl>,
#   remaining_lease_mths <dbl>, resale_price <dbl>

##3.3 Retrieve addresses & its coordinates

Postal codes and coordinates of the addresses will be used to get the proximity to various location factors later on.

###3.3.1 Producing unique list of addresses

Unique addresses will be stored in add_list by using unique() function of base R package and sort() function of R package to extract the unique addresses and sort the unique vector respectively.

add_list <- sort(unique(rs_transform$address))

###3.3.2 Coordinates from OneMapSG API

A dataframe postal_coords will be created to store all final retrieved coordinates. To performa GET request, to https://developers.onemap.sg/commonapi/search, the GET() function of httr package will be used. There are a few search arguments variables and information we have to take note of

  • searchVal: Unique keywords that user will enter to filter results

  • returnGeom {Y/N}: Yes or No to check if user want to return the geometry

  • getAddrDetails {Y/N}: Yes or No to check if user want to return address details for a point

Return JSON response will contain many fields but we are only interested in postal code and coordinates like Longitude and Latitude. A new dataframe new_row will be created and is used to store each final set of coordinates retrieved. There is also the need to check the number of responses because some searched location have 0 as some only have 1 result and others have many. Finally, the JSON result will be appended to the dataframe postal_coords using rbind() function of R.

get_coords <- function(add_list){

  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

###3.3.3 Retrieve resale coordinates

coords <- get_coords(add_list)

###3.3.4 Check for NA results

Check if columns contain any Nil or NA values with is.na() function of R

coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

As observed, data such as 215 CHOA CHU KANG CTRL has nil postal code despite having coordinates available. When performing a search on gothere.sg, the postal code should be 680215.

###3.3.5 Combining HDB resale and coordinate data

After retrieving the coordinates, we need to combine with the HDB resale dataset using left_join() function of dplyr package. The data will be stored in rs_coords.

rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))
head(rs_coords)
# A tibble: 6 × 15
  month   town   address block street_name flat_type storey_range floor_area_sqm
  <chr>   <chr>  <chr>   <chr> <chr>       <chr>     <chr>                 <dbl>
1 2021-01 ANG M… 331 AN… 331   ANG MO KIO… 3 ROOM    04 TO 06                 68
2 2021-01 ANG M… 534 AN… 534   ANG MO KIO… 3 ROOM    04 TO 06                 68
3 2021-01 ANG M… 561 AN… 561   ANG MO KIO… 3 ROOM    01 TO 03                 68
4 2021-01 ANG M… 170 AN… 170   ANG MO KIO… 3 ROOM    07 TO 09                 60
5 2021-01 ANG M… 463 AN… 463   ANG MO KIO… 3 ROOM    04 TO 06                 68
6 2021-01 ANG M… 542 AN… 542   ANG MO KIO… 3 ROOM    04 TO 06                 68
# ℹ 7 more variables: flat_model <chr>, lease_commence_date <dbl>,
#   remaining_lease_mths <dbl>, resale_price <dbl>, postal <chr>,
#   latitude <chr>, longitude <chr>

##3.4 Write and read the combined file to rds

rs_coords_rds <- write_rds(rs_coords, "data/rds/rs_coords.rds")
rs_coords <- read_rds("data/rds/rs_coords.rds")
glimpse(rs_coords)
Rows: 13,780
Columns: 15
$ month                <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
$ town                 <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ address              <chr> "331 ANG MO KIO AVE 1", "534 ANG MO KIO AVE 10", …
$ block                <chr> "331", "534", "561", "170", "463", "542", "170", …
$ street_name          <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO …
$ flat_type            <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM",…
$ storey_range         <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "…
$ floor_area_sqm       <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 7…
$ flat_model           <chr> "New Generation", "New Generation", "New Generati…
$ lease_commence_date  <dbl> 1981, 1980, 1980, 1986, 1980, 1981, 1986, 1976, 1…
$ remaining_lease_mths <dbl> 708, 698, 697, 770, 698, 709, 768, 652, 682, 664,…
$ resale_price         <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 2…
$ postal               <chr> "560331", "560534", "560561", "560170", "560463",…
$ latitude             <chr> "1.36211140145298", "1.37405846295585", "1.370577…
$ longitude            <chr> "103.85076647513", "103.854168170426", "103.85785…

###3.4.1 Transform and Assign CRS

We will need to assign the CRS of 4326 first before transforming it to 3414 which is the EPSG code for Singapore SVY21 since the projected CRS will be WGS84

rs_coords_sf <- st_as_sf(rs_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)
st_crs(rs_coords_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

###3.4.2 Plotting HDB resale points

tmap_mode("view")
tm_shape(rs_coords_sf)+
  tm_dots(col="blue", size = 0.02)
tmap_mode("plot")

#4. Importing Geospatial Locational Factors

##4.1 Location factors with coordinates

###4.1.1 Reading of location factors

elder_sf <- st_read(dsn = "data/geospatial", layer="ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
mrtlrt_sf <- st_read(dsn = "data/geospatial", layer="MRTLRTStnPtt")
Reading layer `MRTLRTStnPtt' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 185 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6138.311 ymin: 27555.06 xmax: 45254.86 ymax: 47854.2
Projected CRS: SVY21
bus_sf <- st_read(dsn = "data/geospatial", layer="BusStop")
Reading layer `BusStop' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
hawker_sf <- st_read("data/geospatial/hawker-centres-geojson.geojson") 
Reading layer `hawker-centres-geojson' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\hawker-centres-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449217
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
supermkt_sf <- st_read("data/geospatial/supermarkets-geojson.geojson") 
Reading layer `supermarkets-geojson' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
chas_sf <- st_read("data/geospatial/moh-chas-clinics.geojson")
Reading layer `moh-chas-clinics' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\moh-chas-clinics.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1064 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
childcare_sf <- st_read("data/geospatial/childcare.geojson") 
Reading layer `childcare' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\childcare.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1545 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6824 ymin: 1.248403 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
kind_sf <- st_read("data/geospatial/preschools-location.geojson") 
Reading layer `preschools-location' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\preschools-location.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1925 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6824 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
parks_sf <- st_read("data/geospatial/parks-geojson.geojson")
Reading layer `parks-geojson' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial\parks-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 350 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214058 xmax: 104.0017 ymax: 1.461503
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
st_crs(elder_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(mrtlrt_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
st_crs(hawker_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(supermkt_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(chas_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(childcare_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(kind_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
st_crs(parks_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

###4.1.2 Assign EPSG code to sf dataframes

elder_sf <- st_set_crs(elder_sf, 3414)
mrtlrt_sf <- st_set_crs(mrtlrt_sf, 3414)
bus_sf <- st_set_crs(bus_sf, 3414)
hawker_sf <- hawker_sf %>%
  st_transform(crs = 3414)
supermkt_sf <- supermkt_sf %>%
  st_transform(crs = 3414)
chas_sf <- chas_sf %>%
  st_transform(crs = 3414)
childcare_sf <- childcare_sf %>%
  st_transform(crs = 3414)
kind_sf <- kind_sf %>%
  st_transform(crs = 3414)
parks_sf <- parks_sf %>%
  st_transform(crs = 3414)
st_crs(elder_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(mrtlrt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(bus_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(hawker_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(supermkt_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(chas_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(childcare_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(kind_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
#st_crs(parks_sf)
length(which(st_is_valid(elder_sf) == FALSE))
[1] 0
length(which(st_is_valid(mrtlrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(hawker_sf) == FALSE))
[1] 0
length(which(st_is_valid(parks_sf) == FALSE))
[1] 0
length(which(st_is_valid(supermkt_sf) == FALSE))
[1] 0
length(which(st_is_valid(chas_sf) == FALSE))
[1] 0
length(which(st_is_valid(childcare_sf) == FALSE))
[1] 0
length(which(st_is_valid(kind_sf) == FALSE))
[1] 0
length(which(st_is_valid(bus_sf) == FALSE))
[1] 0

###4.1.3 Proximity Calculation of location

get_prox function will be created to perform calculation of distances between the HDB resale and location factors

get_prox <- function(origin_df, dest_df, col_name){
  
  # creates a matrix of distances
  dist_matrix <- st_distance(origin_df, dest_df)           
  
  # find the nearest location_factor and create new data frame
  near <- origin_df %>% 
    mutate(PROX = apply(dist_matrix, 1, function(x) min(x)) / 1000) 
  
  # rename column name according to input parameter
  names(near)[names(near) == 'PROX'] <- col_name

  return(near)
}

get_prox function will be called to get the proximity of the resale HDB and location factors such as Eldercare, MRT, Hawker, Parks, Supermarkets & CHAS clinics

rs_coords_sf <- get_prox(rs_coords_sf, elder_sf, "PROX_ELDERLYCARE") 
rs_coords_sf <- get_prox(rs_coords_sf, mrtlrt_sf, "PROX_MRT") 
rs_coords_sf <- get_prox(rs_coords_sf, hawker_sf, "PROX_HAWKER") 
rs_coords_sf <- get_prox(rs_coords_sf, parks_sf, "PROX_PARK") 
rs_coords_sf <- get_prox(rs_coords_sf, supermkt_sf, "PROX_SUPERMARKET")
rs_coords_sf <- get_prox(rs_coords_sf, chas_sf, "PROX_CHAS")

###4.1.4 Calculating location factors within distance

get_within function will create a matrix of distances between the HDB and the location factor using st_distance of sf package.

Also, it will provide the sum of points of the location factor that are within the threshold distance using sum function of base R package which will be added to HDB resale data under a new column using mutate() function of dpylr package.

Lastly, it will rename the column name according to input given by user so that the columns have appropriate and distinct names that are different from one another.

get_within <- function(origin_df, dest_df, threshold_dist, col_name){
  
  # creates a matrix of distances
  dist_matrix <- st_distance(origin_df, dest_df)   
  
  # count the number of location_factors within threshold_dist and create new data frame
  wdist <- origin_df %>% 
    mutate(WITHIN_DT = apply(dist_matrix, 1, function(x) sum(x <= threshold_dist)))
  
  # rename column name according to input parameter
  names(wdist)[names(wdist) == 'WITHIN_DT'] <- col_name

  # Return df
  return(wdist)
}

The threshold will be set within 350m to detectand call all location factors such as, kindergartens, childcare and bus stops.

rs_coords_sf <- get_within(rs_coords_sf, kind_sf, 350, "WITHIN_350M_KINDERGARTEN")
rs_coords_sf <- get_within(rs_coords_sf, childcare_sf, 350, "WITHIN_350M_CHILDCARE")
rs_coords_sf <- get_within(rs_coords_sf, bus_sf, 350, "WITHIN_350M_BUS")

##4.2 Location factors without coordinates

###4.2.1 CBD Town Area

This is an important measure as Central Business District is an area where majority of the citizens travel to daily for work. By refering to google, the latitude and longitude of Downtown Core are at 1.287953 and 103.851784 respectively.

# Store CBD Coordinates in Dataframe
name <- c('CBD Area')
latitude= c(1.287953)
longitude= c(103.851784)
cbd_coords <- data.frame(name, latitude, longitude)

Convert data frame into sf object and perform transformation of crs

cbd_coords_sf <- st_as_sf(cbd_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

st_crs(cbd_coords_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

get_prox function will be called to get the proximity between the HDB resale flats and CBD area

rs_coords_sf <- get_prox(rs_coords_sf, cbd_coords_sf, "PROX_CBD")

###4.2.2 Shopping Malls

We can refer to the wikipedia page for the XPaths of the various lists.

url <- "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
malls_list <- list()

for (i in 1:7){
  malls <- read_html(url) %>%
    html_nodes(xpath = paste('//*[@id="mw-content-text"]/div[1]/div[',as.character(i),']/ul/li',sep="") ) %>%
    html_text()
  malls_list <- append(malls_list, malls)
}

#malls_list

From the malls_list result, there are 164 shopping malls. As these malls does not have their coordinates, we need to use the get_coords function created previously to search the names of these shopping malls.

malls_list_coords <- get_coords(malls_list) %>% 
  rename("mall_name" = "address")
malls_list_coords <- subset(malls_list_coords, mall_name!= "Yew Tee Shopping Centre")

We need to update the mall names that are no longer relevant and also malls with an index beside it due to Wikipedia page numbering as shown above output

invalid_malls<- subset(malls_list_coords, is.na(malls_list_coords$postal))
invalid_malls_list <- unique(invalid_malls$mall_name)
corrected_malls <- c("Clarke Quay", "City Gate", "Raffles Holland V", "Knightsbridge", "Mustafa Centre", "GR.ID", "Shaw House",
                     "The Poiz Centre", "Velocity @ Novena Square", "Singapore Post Centre", "PLQ Mall", "KINEX", "The Grandstand")

for (i in 1:length(invalid_malls_list)) {
  malls_list_coords <- malls_list_coords %>% 
    mutate(mall_name = ifelse(as.character(mall_name) == invalid_malls_list[i], corrected_malls[i], as.character(mall_name)))
}
malls_list <- sort(unique(malls_list_coords$mall_name))
malls_coords <- get_coords(malls_list)

To check if the malls still have any NA values

malls_coords[(is.na(malls_coords$postal) | is.na(malls_coords$latitude) | is.na(malls_coords$longitude)), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

To convert the mall dataframe into SF object and perform transformation of CRS

malls_sf <- st_as_sf(malls_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

get_prox function will be called to get the proximity between the HDB resale flats and Shopping Malls

rs_coords_sf <- get_prox(rs_coords_sf, malls_sf, "PROX_MALL") 

###4.2.3 Primary Schools

pri_sch <- read_csv("data/aspatial/general-information-of-schools.csv")

pri_sch <- pri_sch %>%
  filter(mainlevel_code == "PRIMARY") %>%
  select(school_name, address, postal_code, mainlevel_code)

glimpse(pri_sch)
Rows: 183
Columns: 4
$ school_name    <chr> "ADMIRALTY PRIMARY SCHOOL", "AHMAD IBRAHIM PRIMARY SCHO…
$ address        <chr> "11   WOODLANDS CIRCLE", "10   YISHUN STREET 11", "100 …
$ postal_code    <chr> "738907", "768643", "579646", "159016", "544969", "5697…
$ mainlevel_code <chr> "PRIMARY", "PRIMARY", "PRIMARY", "PRIMARY", "PRIMARY", …

As shown above,there are 183 Primary Schools in Singapore.

For this step, you are required to create a list to store postal code of the schools and retrieve the coordinates

prisch_list <- sort(unique(pri_sch$postal_code))
prisch_coords <- get_coords(prisch_list)

To check for NA values

prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude)), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

To combine coordinates with the schools

prisch_coords = prisch_coords[c("postal","latitude", "longitude")]
pri_sch <- left_join(pri_sch, prisch_coords, by = c('postal_code' = 'postal'))

To convert primary schools dataframe into SF object and perform transformation of CRS

prisch_sf <- st_as_sf(pri_sch,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

To call get_within function to obtain schools within a proximity of 1km

rs_coords_sf <- get_within(rs_coords_sf, prisch_sf, 1000, "WITHIN_1KM_PRISCH")

###4.2.4 Proximity to Good Primary Schools

To identify good primary schools in Singapore, one particular forum that we can use is www.salary.sg where it provides ranking of schools according to their popularity. Similarly, we can reuse the code that we extract the shopping malls in Singapore to extract the ranking of primary schools.

url <- "https://www.salary.sg/2021/best-primary-schools-2021-by-popularity/"

good_pri <- data.frame()

schools <- read_html(url) %>%
  html_nodes(xpath = paste('//*[@id="post-3068"]/div[3]/div/div/ol/li') ) %>%
  html_text() 

for (i in (schools)){
  sch_name <- toupper(gsub(" – .*","",i))
  sch_name <- gsub("\\(PRIMARY SECTION)","",sch_name)
  sch_name <- trimws(sch_name)
  new_row <- data.frame(pri_sch_name=sch_name)
  # Add the row
  good_pri <- rbind(good_pri, new_row)
}

top_good_pri <- head(good_pri, 10)

head(top_good_pri)
                         pri_sch_name
1     CHIJ ST. NICHOLAS GIRLS’ SCHOOL
2                      AI TONG SCHOOL
3                CATHOLIC HIGH SCHOOL
4                       ROSYTH SCHOOL
5 PEI HWA PRESBYTERIAN PRIMARY SCHOOL
6              NANYANG PRIMARY SCHOOL

To check if the extracted top primary schools’ name matches the existing names in the primary school dataframe

top_good_pri$pri_sch_name[!top_good_pri$pri_sch_name %in% prisch_sf$school_name]
[1] "CHIJ ST. NICHOLAS GIRLS’ SCHOOL" "CATHOLIC HIGH SCHOOL"           
[3] "ST. HILDA’S PRIMARY SCHOOL"     
#Unique list to store good school names
good_pri_list <- unique(top_good_pri$pri_sch_name)

To obtain the coordinates of good primary schools

goodprisch_coords <- get_coords(good_pri_list)

To check for NA values

goodprisch_coords[(is.na(goodprisch_coords$postal) | is.na(goodprisch_coords$latitude) | is.na(goodprisch_coords$longitude)), ]
                      address postal latitude longitude
10 ST. HILDA’S PRIMARY SCHOOL   <NA>     <NA>      <NA>

Renaming the names to match with the primary school dataframe names

top_good_pri$pri_sch_name[top_good_pri$pri_sch_name == "CHIJ ST. NICHOLAS GIRLS’ SCHOOL"] <- "CHIJ SAINT NICHOLAS GIRLS' SCHOOL"
top_good_pri$pri_sch_name[top_good_pri$pri_sch_name == "ST. HILDA’S PRIMARY SCHOOL"] <- "SAINT HILDA'S PRIMARY SCHOOL"

#Unique list to store good school names
good_pri_list <- unique(top_good_pri$pri_sch_name)

#To obtain the coordinates of good primary schools
goodprisch_coords <- get_coords(good_pri_list)

Final inspection to check for NA values

goodprisch_coords[(is.na(goodprisch_coords$postal) | is.na(goodprisch_coords$latitude) | is.na(goodprisch_coords$longitude)), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

To convert the good primary schools dataframe into SF object and perform transformation of CRS

goodpri_sf <- st_as_sf(goodprisch_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

st_crs(goodpri_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

get_prox function will be called to get the proximity between the HDB resale flats and good primary schools

rs_coords_sf <- get_prox(rs_coords_sf, goodpri_sf, "PROX_GOOD_PRISCH")

##4.3 Filtering dataset

rs_coords_sf_training <-  filter(rs_coords_sf ,flat_type == "3 ROOM") %>% 
              filter(month >= "2021-01" & month <= "2023-02")

###4.3.1 Writing to RDS file

rs_factors_rds_training <- write_rds(rs_coords_sf_training, "data/rds/rs_factors_training.rds")

#5. Import Data for Analysis

##5.1 Geospatial Data

Here, st_read of sf package to read simple features or layers from file.

mpsz_sf <- st_read(dsn = "data/geospatial", layer="MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `C:\Harith-oh\IS415-Harith\Take-home_Ex3\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
mpsz_sf <- st_transform(mpsz_sf, 3414)

R object used to contain the imported MPSZ-2019 shapefile is called mpsz_sf and it is a simple feature object. The correct EPSG code for SVY21 should be 3414 therefore a transformation is done above.

###5.1.1 Check for invalid geometry and handle it

length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 6

Turning the invalid geometry valid by using st_make_valid function

mpsz_sf <- st_make_valid(mpsz_sf)
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0

Now the geometry is valid.

##5.2 Aspatial Data for HDB resale location factors

rs_factors_training <- read_rds("data/rds/rs_factors_training.rds")

##5.3 Extract storey_order due to character type

storeys <- sort(unique(rs_factors_training$storey_range))
# Create dataframe storey_range_order to store order of storey_range
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)

# Combine storey_order with resale dataframe for training and testing data
rs_factors_training <- left_join(rs_factors_training, storey_range_order, by= c("storey_range" = "storeys"))

##5.4 Selecting required columns for analysis

rs_factors_training <- rs_factors_training %>%
  select(resale_price, floor_area_sqm, storey_order, remaining_lease_mths,
         PROX_CBD, PROX_ELDERLYCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GOOD_PRISCH, PROX_MALL, PROX_CHAS,
         PROX_SUPERMARKET, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH)

#6. Exploratory Data Analysis

##6.1 HDB 3 room resale prices in Histogram

ggplot(data=rs_factors_training, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The results shown above reveals a right skewed distribution. This means that more resale HDB units were transacted at relative lower prices about $500,000 majority. Statistically, the skewed distribution can be normalised by using log transformation which we will be doing in the next section.

##6.2 Normalise using Log Transformation of HDB 3 room resale prices in Histogram

A new variable called LOG_RESALE_PRICE by using a log transformation on the variable resale_price

rs_factors_training <- rs_factors_training %>%
  mutate(`LOG_SELLING_PRICE` = log(resale_price))

Replot Histogram of LOG_RESALE_PRICE

ggplot(data=rs_factors_training, aes(x=`LOG_SELLING_PRICE`)) +
  geom_histogram(bins=20, color="black", fill="green")

Evidently, the distribution is now less skewed after transforming.

##6.3 Various histogram distribution of structural variables

Here we will consider some structural factors involved such as floor_area_sqm, storey_order, remaining_lease_mths

s_factor <- c("floor_area_sqm","remaining_lease_mths", "storey_order")

A list will be created to store the histogram structural variables

s_factor_hist_list <- vector(mode = "list", length = length(s_factor))
for (i in 1:length(s_factor)) {
  hist_plot <- ggplot(rs_factors_training, aes_string(x = s_factor[[i]])) +
    geom_histogram(color="firebrick", fill = "light coral") +
    labs(title = s_factor[[i]]) +
    theme(plot.title = element_text(size = 10),
          axis.title = element_blank())
  
  s_factor_hist_list[[i]] <- hist_plot
}

Plot a histogram based on the distribution of structural variables

ggarrange(plotlist = s_factor_hist_list,
          ncol = 2,
          nrow = 2)

From the histogram results, floor_area_sqm only seem to have a normal distribution as compared to other variables.

The histogram of storey_order is skewed towards the right which means that the resale HDBs in this period and flat_type are generally on the lower levels.

For the remaining_lease_mths, there are a few pattern found of popular transactions involving ~ 600 to 800 months and ~ 1100 months which demonstrates that more transactions are done on resale flats that have minimum 66 years lease remaining.

##6.4 Various histogram of location variables

Extraction of location factors column names to plot

l_factor <- c("PROX_CBD", "PROX_ELDERLYCARE", "PROX_HAWKER", "PROX_MRT", "PROX_PARK", "PROX_GOOD_PRISCH", "PROX_MALL", "PROX_CHAS",
              "PROX_SUPERMARKET", "WITHIN_350M_KINDERGARTEN", "WITHIN_350M_CHILDCARE", "WITHIN_350M_BUS", "WITHIN_1KM_PRISCH")

Similarly, a list will be created to store the histogram location variables

l_factor_hist_list <- vector(mode = "list", length = length(l_factor))
for (i in 1:length(l_factor)) {
  hist_plot <- ggplot(rs_factors_training, aes_string(x = l_factor[[i]])) +
    geom_histogram(color="midnight blue", fill = "light sky blue") +
    labs(title = l_factor[[i]]) +
    theme(plot.title = element_text(size = 10),
          axis.title = element_blank())
  
  l_factor_hist_list[[i]] <- hist_plot
}

Plotting histogram to visualise distribution of location variables

ggarrange(plotlist = l_factor_hist_list,
          ncol = 3,
          nrow = 5)

PROX_GOOD_PRISCH have some patterns involved whereby there more transactions happening within the proximity of approximately 3km. PROX_CBD resemble somewhat a normal distribution.

Other variables like PROX_SUPERMARKET, PROX_HAWKER, PROX_MRT, PROX_MALL, PROX_CHAS, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE is skewed towards the right. This shows that there are more resale transactions involved and residents prefer to stay near resale flats with these amenities.

##6.5 Statistical Point Map

tmap_mode("view")
tm_shape(rs_factors_training) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14)) +
tm_basemap("OpenStreetMap")
tmap_mode("plot")

From the interactive map, we can observe that 3 room resale HDBs around the North-East, South and Central is indicated by the darker orange points which means that the points tend to have higher resale prices. Also, there are lighter yellow points concentrated around the North and East area.

#7. Data Sampling of Training and Testing Data

The entire data are split into training and test data sets with 2021-01 to 2022-12 being the training set and 2023-01 to 2023-02 being the testing set. This can be done by using initial_split() of rsample package. Rsample is one of the package of tidymodels.

Due to time constraint and huge dataset, we will only use 6 months (2022-07 to 2022-12) of dataset to perform the training instead.

set.seed(1234)
rs_coords_sf <-  filter(rs_factors_rds_training,flat_type == "3 ROOM") %>% 
              filter(month >= "2022-07" & month <= "2023-02")
resale_split <- initial_split(rs_coords_sf[,8:26], 
                              prop = 7.5/10,)

train_data <- training(resale_split)
test_data <- testing(resale_split)

##7.1 Computing Correlation Matrix

rs_coords_sf_nogeo <- rs_coords_sf %>%
  st_drop_geometry()
corrplot::corrplot(cor(rs_coords_sf_nogeo[, 14:26]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

As a result, all the correlation values are below 0.8 with the exception of WITHIN_350M_CHILDCARE that has a value of 0.92 which means that WITHIN_350M_CHILDCARE has a sign of multicolinearity with WTIHIN_350M_KINDERGARTEN.

##7.2 Non-spatial multiple linear regression

The following code chunk is to build the non-spatial multiple linear regression.

price_mlr <- lm(resale_price ~ floor_area_sqm +
                  remaining_lease_mths +
                  PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER +
                  PROX_MRT + PROX_PARK + PROX_MALL + 
                  PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                  WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                  WITHIN_1KM_PRISCH,
                data=train_data)
summary(price_mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + 
    PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + 
    PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN + 
    WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH, 
    data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-215209  -30299   -3360   24682  451378 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -1.207e+05  1.104e+04 -10.935  < 2e-16 ***
floor_area_sqm            5.172e+03  1.458e+02  35.475  < 2e-16 ***
remaining_lease_mths      3.744e+02  5.077e+00  73.742  < 2e-16 ***
PROX_CBD                 -9.255e+03  2.511e+02 -36.864  < 2e-16 ***
PROX_ELDERLYCARE         -9.511e+03  1.745e+03  -5.451 5.38e-08 ***
PROX_HAWKER              -1.096e+04  2.316e+03  -4.732 2.32e-06 ***
PROX_MRT                 -1.562e+04  2.477e+03  -6.305 3.26e-10 ***
PROX_PARK                -1.168e+04  2.395e+03  -4.879 1.12e-06 ***
PROX_MALL                -1.217e+04  2.590e+03  -4.697 2.74e-06 ***
PROX_SUPERMARKET          2.729e+04  4.882e+03   5.589 2.46e-08 ***
WITHIN_350M_KINDERGARTEN  7.231e+03  1.108e+03   6.527 7.72e-11 ***
WITHIN_350M_CHILDCARE    -1.025e+04  1.357e+03  -7.554 5.40e-14 ***
WITHIN_350M_BUS           1.428e+03  3.396e+02   4.206 2.66e-05 ***
WITHIN_1KM_PRISCH        -1.960e+03  6.721e+02  -2.917  0.00356 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48430 on 3330 degrees of freedom
Multiple R-squared:  0.6886,    Adjusted R-squared:  0.6874 
F-statistic: 566.4 on 13 and 3330 DF,  p-value: < 2.2e-16

The r-squared shows a value of 0.6869 and the adjusted r-squared shows a value of 0.6857.The difference between R squared and adjusted R squared value is that R squared value assumes that all the independent variables considered affect the result of the model, whereas the adjusted R squared value considers only those independent variables which actually have an effect on the performance of the model. So although the regression line is a good fit, there is lesser effect on the performance of the model than expected.

##7.3 Geographically Weighted Regression Predictive Method

Lets calibrate the model to predict HDB resale price by using geographically weighted regression method of GWmodel package.

Firstly, we convert the sf data.frame to SpatialPointDataFrame

train_data_sp <- as_Spatial(train_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 3344 
extent      : 11807.06, 45154.27, 28097.64, 48622.47  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 18
names       : floor_area_sqm, flat_model, lease_commence_date, remaining_lease_mths, resale_price, postal,     PROX_ELDERLYCARE,           PROX_MRT,        PROX_HAWKER,          PROX_PARK,     PROX_SUPERMARKET,            PROX_CHAS, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, ... 
min values  :             51,       DBSS,                1967,                  516,       245000, 050004, 2.90010872692511e-07, 0.0391289507571673, 0.0069808810867684, 0.0834312052773543, 1.46885170959002e-09, 1.16317240314437e-09,                        0,                     0,               1, ... 
max values  :            155,    Terrace,                2019,                 1150,      1068000, 824601,     4.78082950913573,   3.51571661813739,   2.46908408139604,   2.37596211741372,     3.31803184483494,     2.71421661000055,                       12,                    12,              16, ... 

Computing adaptive bandwidth

Next, bw.gwr() of GWmodel package will be used to determine the optimal bandwidth to be used.

bw_adaptive <- bw.gwr(resale_price ~ floor_area_sqm +
                  remaining_lease_mths +
                  PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER +
                  PROX_MRT + PROX_PARK + PROX_MALL + 
                  PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                  WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                  WITHIN_1KM_PRISCH,
                  data=train_data_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 2074 CV score: 7.276392e+12 
Adaptive bandwidth: 1290 CV score: 6.829296e+12 
Adaptive bandwidth: 804 CV score: 6.285662e+12 
Adaptive bandwidth: 505 CV score: 5.61714e+12 
Adaptive bandwidth: 319 CV score: 4.827511e+12 
Adaptive bandwidth: 205 CV score: 4.479947e+12 
Adaptive bandwidth: 133 CV score: 4.299238e+12 
Adaptive bandwidth: 90 CV score: 4.224232e+12 
Adaptive bandwidth: 62 CV score: 4.120263e+12 
Adaptive bandwidth: 46 CV score: 4.029567e+12 
Adaptive bandwidth: 34 CV score: 3.897764e+12 
Adaptive bandwidth: 29 CV score: 3.757576e+12 
Adaptive bandwidth: 23 CV score: 3.828702e+12 
Adaptive bandwidth: 29 CV score: 3.757576e+12 

As shown above, 29 neighbour points will be the optimal bandwidth to be used.

Constructing the adaptive bandwidth GWR model

gwr_adaptive <- gwr.basic(formula = resale_price ~
                            floor_area_sqm +
                            remaining_lease_mths + PROX_CBD + 
                            PROX_ELDERLYCARE + PROX_HAWKER +
                            PROX_MRT + PROX_PARK + PROX_MALL + 
                            PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                            WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                            WITHIN_1KM_PRISCH,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-25 05:17:18 
   Call:
   gwr.basic(formula = resale_price ~ floor_area_sqm + remaining_lease_mths + 
    PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + 
    PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN + 
    WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH, 
    data = train_data_sp, bw = bw_adaptive, kernel = "gaussian", 
    adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  floor_area_sqm remaining_lease_mths PROX_CBD PROX_ELDERLYCARE PROX_HAWKER PROX_MRT PROX_PARK PROX_MALL PROX_SUPERMARKET WITHIN_350M_KINDERGARTEN WITHIN_350M_CHILDCARE WITHIN_350M_BUS WITHIN_1KM_PRISCH
   Number of data points: 3344
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-215209  -30299   -3360   24682  451378 

   Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
   (Intercept)              -1.207e+05  1.104e+04 -10.935  < 2e-16 ***
   floor_area_sqm            5.172e+03  1.458e+02  35.475  < 2e-16 ***
   remaining_lease_mths      3.744e+02  5.077e+00  73.742  < 2e-16 ***
   PROX_CBD                 -9.255e+03  2.511e+02 -36.864  < 2e-16 ***
   PROX_ELDERLYCARE         -9.511e+03  1.745e+03  -5.451 5.38e-08 ***
   PROX_HAWKER              -1.096e+04  2.316e+03  -4.732 2.32e-06 ***
   PROX_MRT                 -1.562e+04  2.477e+03  -6.305 3.26e-10 ***
   PROX_PARK                -1.168e+04  2.395e+03  -4.879 1.12e-06 ***
   PROX_MALL                -1.217e+04  2.590e+03  -4.697 2.74e-06 ***
   PROX_SUPERMARKET          2.729e+04  4.882e+03   5.589 2.46e-08 ***
   WITHIN_350M_KINDERGARTEN  7.231e+03  1.108e+03   6.527 7.72e-11 ***
   WITHIN_350M_CHILDCARE    -1.025e+04  1.357e+03  -7.554 5.40e-14 ***
   WITHIN_350M_BUS           1.428e+03  3.396e+02   4.206 2.66e-05 ***
   WITHIN_1KM_PRISCH        -1.960e+03  6.721e+02  -2.917  0.00356 ** 

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 48430 on 3330 degrees of freedom
   Multiple R-squared: 0.6886
   Adjusted R-squared: 0.6874 
   F-statistic: 566.4 on 13 and 3330 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 7.809959e+12
   Sigma(hat): 48341.61
   AIC:  81654.95
   AICc:  81655.09
   BIC:  78524.4
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 29 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                   Min.     1st Qu.      Median     3rd Qu.
   Intercept                -4980253.13  -232986.04   -32049.31   169467.34
   floor_area_sqm              -9574.79     2464.43     3438.08     4478.22
   remaining_lease_mths        -3648.29      215.90      350.52      477.35
   PROX_CBD                  -629169.82   -19192.90    -5984.63     7171.31
   PROX_ELDERLYCARE          -383505.02   -12968.76     7232.76    29984.77
   PROX_HAWKER               -450126.82   -27552.01    -2462.73    30991.70
   PROX_MRT                  -710511.71   -62468.08   -31654.05    -5846.06
   PROX_PARK                 -251974.70   -36228.92    -2538.87    26072.89
   PROX_MALL                 -603433.33   -29692.24     1450.35    34098.72
   PROX_SUPERMARKET         -1184867.87   -38761.80    -4874.35    29857.60
   WITHIN_350M_KINDERGARTEN   -59965.43    -4462.57      568.81     5428.25
   WITHIN_350M_CHILDCARE     -108055.65    -6893.16     -985.86     4095.16
   WITHIN_350M_BUS            -14758.58    -1472.04      413.62     2462.64
   WITHIN_1KM_PRISCH         -324466.99    -6455.76    -1683.59     3402.55
                                  Max.
   Intercept                4039875.35
   floor_area_sqm             15900.42
   remaining_lease_mths         732.32
   PROX_CBD                  633909.91
   PROX_ELDERLYCARE          790166.34
   PROX_HAWKER              1252984.53
   PROX_MRT                  400584.35
   PROX_PARK                 536692.45
   PROX_MALL                 787334.47
   PROX_SUPERMARKET          811138.22
   WITHIN_350M_KINDERGARTEN  102394.35
   WITHIN_350M_CHILDCARE     183578.37
   WITHIN_350M_BUS            24473.74
   WITHIN_1KM_PRISCH         454330.66
   ************************Diagnostic information*************************
   Number of data points: 3344 
   Effective number of parameters (2trace(S) - trace(S'S)): 744.4552 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 2599.545 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 78536.97 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 77642.79 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 78668.9 
   Residual sum of squares: 1.975628e+12 
   R-square value:  0.9212289 
   Adjusted R-square value:  0.8986619 

   ***********************************************************************
   Program stops at: 2023-03-25 05:17:25 

For this case, the R-square value: 0.9211137 and Adjusted R-square value: 0.8984393 which shows that there is a very good fit of regression line to the data if adaptive bandwidth is used for this data set.

##7.4 Preparing Coordinates Data

To extract the x,y coordinates of the full, training and test data sets.

coords <- st_coordinates(rs_coords_sf)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

Dropping Geometry Field

We will drop geometry column of the sf data.frame by using st_drop_geometry() of sf package.

train_data <- train_data %>% 
  st_drop_geometry()

##7.5 Calibrating Random Forest Model

In this section, calibration of a model to predict HDB resale price will be done by using random forest function of ranger package.

set.seed(1234)
rf <- ranger(resale_price ~ floor_area_sqm + 
               remaining_lease_mths + PROX_CBD + PROX_ELDERLYCARE + 
               PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MALL + 
               PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
               WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + 
               WITHIN_1KM_PRISCH,
             data=train_data)
print(rf)
Ranger result

Call:
 ranger(resale_price ~ floor_area_sqm + remaining_lease_mths +      PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK +      PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +      WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH,      data = train_data) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3344 
Number of independent variables:  13 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       908069907 
R squared (OOB):                  0.8789633 

In the result, the r-squared has shown the figure of 0.8793007. The closer the r-squared value is to 1, the better the fit. So since the figure is relatively close to 1, this means that the regression line fits the data quite well.

##7.6 Calibrating Geographical Random Forest Model

In this section, calibration of a model to predict HDB resale price will be done by using grf() of SpatialML package.

###7.6.1 Calibrating using training data.

The code chunk below calibrate a geographic random forest model by using grf() of SpatialML package.

set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~ floor_area_sqm +
                       remaining_lease_mths + PROX_CBD + PROX_ELDERLYCARE +
                       PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MALL +
                       PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +
                       WITHIN_350M_CHILDCARE + WITHIN_350M_BUS +
                       WITHIN_1KM_PRISCH,
                     dframe=train_data, 
                     bw=29,
                     kernel="adaptive",
                     coords=coords_train)
Ranger result

Call:
 ranger(resale_price ~ floor_area_sqm + remaining_lease_mths +      PROX_CBD + PROX_ELDERLYCARE + PROX_HAWKER + PROX_MRT + PROX_PARK +      PROX_MALL + PROX_SUPERMARKET + WITHIN_350M_KINDERGARTEN +      WITHIN_350M_CHILDCARE + WITHIN_350M_BUS + WITHIN_1KM_PRISCH,      data = train_data, num.trees = 500, mtry = 4, importance = "impurity",      num.threads = NULL) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3344 
Number of independent variables:  13 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       884540419 
R squared (OOB):                  0.8820995 
          floor_area_sqm     remaining_lease_mths                 PROX_CBD 
            3.475222e+12             9.707327e+12             4.463145e+12 
        PROX_ELDERLYCARE              PROX_HAWKER                 PROX_MRT 
            7.098509e+11             1.141180e+12             1.269908e+12 
               PROX_PARK                PROX_MALL         PROX_SUPERMARKET 
            7.939024e+11             8.066853e+11             8.443973e+11 
WITHIN_350M_KINDERGARTEN    WITHIN_350M_CHILDCARE          WITHIN_350M_BUS 
            2.359512e+11             3.502023e+11             2.990132e+11 
       WITHIN_1KM_PRISCH 
            4.452732e+11 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-328439.1  -12212.1      43.9     802.5   12861.1  578714.3 
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-105158.42   -5394.42       4.79     -38.44    5277.39  114090.99 
                               Min          Max        Mean         StD
floor_area_sqm                   0 820484857944 10422165727 38207216483
remaining_lease_mths     188662784 319055039575 17866830832 38145543472
PROX_CBD                 101937448 142276839382  5761662490 13869073188
PROX_ELDERLYCARE         104095347 667016931411  7725731142 26066267291
PROX_HAWKER              127332433 139666769224  5767490124 12687436007
PROX_MRT                 116170844 481039605970  7256303205 25981313263
PROX_PARK                 64785454 618268503018  8041581816 33540521589
PROX_MALL                 71567569 352809966932  6794351930 21399168175
PROX_SUPERMARKET          74027262 637398387744  7483835255 33844732102
WITHIN_350M_KINDERGARTEN   9237484 188055279753  2891154690  8095425136
WITHIN_350M_CHILDCARE      7782921  65577242545  2125037689  5464835513
WITHIN_350M_BUS           13267811 155314918270  4359043897 13518665397
WITHIN_1KM_PRISCH                0  90173678644  1669554060  6358926806

In this case, the r-squared has shown the figure of 0.8829828. The closer the r-squared value is to 1, the better the fit. So since the figure is relatively close to 1, this means that the regression line fits the data quite well.

write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/rds/gwRF_adaptive.rds")

###7.6.2 Predicting using Test Data

The code chunk below will be used to combine the test data with its corresponding coordinates data.

Preparing the test data

test_data <- cbind(test_data, coords_test) %>%
  st_drop_geometry()

Predicting with test data

Next, predict.grf() of spatialML package will be used to predict the resale value by using the test data and gwRF_adaptive model calibrated earlier.

gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test_data, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)

Before moving on, let us save the output into rds file for future use.

GRF_pred <- write_rds(gwRF_pred, "data/rds/GRF_pred.rds")
GRF_pred
   [1] 360854.8 341081.0 426346.0 351358.3 360183.4 342303.0 376290.0 346821.4
   [9] 372836.6 347931.4 394486.5 331749.2 361610.5 359481.7 350623.9 355095.5
  [17] 331357.1 382630.7 558306.8 374674.6 360082.6 355718.3 399743.3 357867.1
  [25] 409777.9 409777.9 339629.6 415621.2 424191.5 359733.5 348572.8 356948.7
  [33] 376778.8 347296.8 359132.7 712554.9 351028.6 682193.4 288256.5 518638.2
  [41] 407044.7 353523.3 378653.5 453368.5 354426.5 369694.5 384389.6 608416.7
  [49] 377682.1 428617.3 571664.1 305831.9 549891.8 391439.7 342452.7 384091.7
  [57] 359269.6 468631.6 495787.0 379824.4 298181.1 278505.5 281724.8 369605.7
  [65] 341459.2 424189.7 386332.1 334724.1 344938.8 279686.1 688642.9 338129.1
  [73] 511172.1 422589.7 422589.7 674512.5 532493.7 380143.6 453580.5 438412.5
  [81] 435818.3 441749.8 454666.9 403635.2 435914.4 576370.7 354866.6 318683.2
  [89] 411249.1 447697.1 444400.3 427232.3 444188.3 385928.6 373959.7 396676.7
  [97] 390230.7 516517.4 517755.8 395835.9 394812.0 394435.3 406574.9 381473.3
 [105] 414252.9 367982.6 329457.2 320299.7 329457.2 697795.3 287126.6 321269.1
 [113] 336278.3 326926.2 336330.8 293352.0 391670.0 317423.7 312923.3 411743.9
 [121] 414046.5 405408.7 413029.9 363319.0 391490.7 380698.5 366211.8 350641.9
 [129] 379950.5 361405.2 398363.4 589319.2 343607.9 419407.5 370396.0 344615.4
 [137] 389313.5 385226.4 393152.0 369112.6 382715.3 355648.8 365800.1 466655.8
 [145] 387183.9 379817.6 363041.1 372773.9 362238.8 337970.2 347491.6 384105.3
 [153] 360860.4 361522.1 346279.9 333574.2 340889.6 364571.8 370057.5 502004.1
 [161] 500388.2 408828.4 616659.9 433593.0 404024.1 629606.3 378782.3 422539.7
 [169] 406687.8 375327.2 408767.5 364213.0 456198.0 382009.5 361885.0 384728.8
 [177] 573393.1 335196.1 358774.8 394120.0 282167.2 303173.7 280151.5 331110.2
 [185] 399811.0 367296.7 336688.5 326130.9 375101.3 365059.9 364612.5 351601.0
 [193] 363190.4 441703.4 333004.6 357218.2 379518.9 387983.0 300709.2 411933.1
 [201] 370959.4 422325.5 319770.7 688642.9 674736.0 418631.9 511172.1 302438.8
 [209] 925869.3 527664.5 563728.0 331943.6 334569.7 494106.0 405495.8 395426.2
 [217] 423225.7 438725.1 432566.0 439554.4 435818.3 437956.0 448068.2 473912.2
 [225] 447937.7 551419.8 359050.9 398360.6 377149.4 535868.7 402399.2 434508.3
 [233] 414721.4 446889.7 446889.7 456209.7 357350.8 445962.6 393114.1 420185.2
 [241] 400562.0 391198.1 386321.1 376245.3 427265.0 685604.0 314371.2 336251.7
 [249] 324452.9 372819.5 295237.3 322471.4 308272.2 386892.5 331670.9 454700.2
 [257] 454700.2 443042.3 454700.2 365502.9 379922.2 358920.4 416491.3 349885.5
 [265] 381299.2 458283.0 347427.4 348119.8 427724.6 416358.3 543986.9 356755.2
 [273] 356445.2 410747.1 397872.6 410747.1 368451.1 390672.0 376438.9 371980.3
 [281] 355179.8 394242.2 356717.1 416708.9 331749.2 296533.2 367777.3 300599.0
 [289] 385737.8 356158.0 319777.6 350581.4 355186.8 349801.9 381612.0 353435.3
 [297] 409521.6 350054.4 359150.3 401094.7 469499.0 438067.2 331442.9 342923.1
 [305] 355936.6 353987.4 388957.5 364134.7 496205.4 356868.5 381450.3 369833.2
 [313] 502095.1 500388.2 521447.8 546136.4 583743.8 734527.7 309014.0 383956.5
 [321] 742438.5 365824.5 381213.9 346664.2 460452.4 342354.6 313480.1 514345.5
 [329] 475290.0 455530.9 541500.0 396894.7 392351.5 291889.8 289647.6 281174.2
 [337] 437480.6 449258.2 286266.3 329016.3 301853.5 322062.7 309510.3 388038.6
 [345] 367296.7 340920.4 317267.6 376035.0 341246.3 424763.3 372057.5 388152.7
 [353] 479031.6 493028.5 350539.6 357437.0 306084.9 382118.3 423182.4 410664.1
 [361] 410664.1 688642.9 386772.9 369603.1 925855.7 527604.5 560442.1 395188.8
 [369] 492649.0 464928.2 464928.2 473890.1 431931.8 473912.2 435369.5 366148.6
 [377] 319310.7 627602.7 675941.4 415419.6 372967.5 368651.5 318983.1 326660.8
 [385] 347060.2 332093.3 632808.8 402395.3 456462.0 440793.2 434196.9 454412.2
 [393] 393460.9 400928.9 426847.2 363140.3 397912.3 391366.7 396843.9 402549.4
 [401] 409439.0 363111.0 359866.9 413510.9 405811.7 289540.8 317995.2 698699.7
 [409] 341851.8 705785.4 683714.6 341530.9 317995.2 727183.7 339213.0 310773.2
 [417] 361264.8 478318.5 295357.3 417332.4 414905.8 414456.7 380589.3 358568.4
 [425] 349200.6 374435.5 345916.7 359927.8 343222.8 369891.0 430338.0 325059.1
 [433] 375949.4 376058.6 352586.6 373039.5 358251.4 337799.3 348691.3 369393.5
 [441] 388049.9 354045.8 380785.4 378399.8 354866.1 361440.2 374255.1 365019.3
 [449] 360038.6 406042.5 344892.0 328704.3 361331.4 445648.9 368156.7 354027.1
 [457] 353029.1 349256.1 444591.2 777637.0 309014.0 600236.6 391671.3 344788.8
 [465] 654273.3 380175.3 414629.7 422029.8 422159.9 418701.2 416332.4 393789.2
 [473] 456198.0 404152.7 532774.0 355565.6 364777.9 378652.5 360385.2 288571.6
 [481] 281174.2 430527.1 314313.1 551824.8 322229.3 333275.6 363867.5 365478.6
 [489] 441695.4 371802.7 368875.1 421778.1 315884.6 299299.1 360008.7 353463.4
 [497] 394831.5 276426.9 365817.9 338633.2 860033.4 681448.7 681448.7 333924.8
 [505] 417758.1 402244.4 454610.1 452288.7 435988.6 437475.8 444263.1 293560.0
 [513] 617830.7 667490.1 667490.1 395868.2 412496.6 447322.7 462520.2 368203.0
 [521] 615078.8 437202.7 461609.0 435450.3 447483.6 454410.2 464723.9 404817.6
 [529] 436676.5 422139.2 444948.3 382816.2 373140.1 341258.2 342793.3 486806.5
 [537] 401628.1 434071.5 428174.6 383188.1 398269.6 290067.5 343236.6 341534.8
 [545] 343280.8 370837.3 363988.4 277205.7 319281.7 337514.9 333808.6 333336.7
 [553] 453124.6 414879.9 414504.7 358634.6 381819.5 359044.3 347888.5 347586.0
 [561] 374912.0 389063.5 400150.7 382806.3 376052.6 380896.0 392515.2 377699.2
 [569] 396382.7 394082.5 392773.0 345816.7 366514.6 652620.2 378385.2 329243.7
 [577] 354627.0 561452.1 360038.6 357867.1 529687.0 338788.1 416178.8 316841.6
 [585] 362476.6 496205.4 346440.5 411463.8 300171.1 323405.7 384306.8 353096.5
 [593] 433257.2 307206.2 380325.1 378299.9 356752.6 366051.5 395269.1 460369.4
 [601] 441611.2 456703.8 377914.4 439007.4 352175.1 430527.1 300911.1 288630.3
 [609] 548657.3 320491.6 377712.0 347096.5 322229.3 385659.3 386727.1 314601.6
 [617] 358597.0 358873.9 373401.4 515918.7 475174.5 361066.1 371802.7 366477.6
 [625] 330114.1 373703.6 274574.4 340457.9 351982.9 466248.0 369603.1 338633.2
 [633] 284394.3 331857.5 555458.6 342212.3 404701.2 347150.5 333924.8 333924.8
 [641] 451374.8 453509.1 435823.0 415488.9 453110.3 453687.6 453687.6 431880.4
 [649] 483608.7 412556.3 644326.5 660794.1 406462.9 350738.2 437824.3 441518.8
 [657] 448685.5 432712.0 460468.9 399016.9 518722.9 390872.1 414519.3 402434.2
 [665] 415034.3 379932.4 384913.1 417422.1 433053.8 430143.5 398324.7 413557.3
 [673] 365574.9 337402.5 347325.2 685527.9 326679.6 320921.2 323003.7 307665.7
 [681] 297620.0 318918.0 333190.9 418631.2 377709.6 387189.8 385704.3 353550.2
 [689] 346859.7 370562.8 363028.3 360571.2 430986.4 417860.9 587681.7 380770.7
 [697] 380332.7 350835.0 356581.0 397185.7 347749.9 378464.5 367344.4 370859.4
 [705] 390384.2 385458.4 361175.3 372827.2 361700.2 515058.1 395542.3 459773.7
 [713] 346016.1 356665.3 348455.3 376099.3 340296.8 500518.2 690016.2 428591.4
 [721] 335004.4 373934.5 349237.4 374526.5 312948.3 372517.3 412109.5 380977.3
 [729] 351403.1 351403.1 397406.4 290687.2 320086.7 283550.8 286748.4 302241.7
 [737] 293140.1 426329.9 357670.2 333275.6 333072.8 380132.9 403054.0 395266.0
 [745] 320572.7 363024.6 365682.8 359444.8 359444.8 451733.4 340457.9 415344.2
 [753] 430755.9 372080.9 348047.3 501547.5 459572.1 411808.3 472654.5 456065.2
 [761] 480263.5 482170.2 412556.3 444529.9 301275.3 322504.8 375733.8 672015.4
 [769] 672015.4 354437.7 462520.2 403095.2 354709.0 447520.1 427017.7 431161.8
 [777] 453565.7 430418.7 430418.7 440464.4 451346.9 453384.5 454444.4 429729.7
 [785] 437157.8 378126.5 421079.7 518722.9 392516.8 399285.5 403134.6 390426.7
 [793] 389341.1 415175.5 415701.4 332486.6 344830.7 678424.0 335275.8 435628.2
 [801] 289249.1 340598.4 319395.7 305767.6 296309.2 307330.4 430522.1 439283.7
 [809] 405408.7 407987.5 399332.1 419011.3 417424.3 438165.1 435498.8 427163.9
 [817] 447945.6 339715.9 409938.2 420693.6 354524.8 339088.0 419329.9 417413.6
 [825] 375027.2 384330.2 358832.3 370307.2 321848.2 347558.7 380171.9 420132.0
 [833] 348301.4 391761.9 375996.1 331225.3 372556.1 380305.4 349639.4 326025.2
 [841] 367871.3 317674.4 333580.5 395631.4 351894.0 377165.3 424008.7 576717.8
 [849] 362087.1 366968.0 381891.4 405674.4 340819.9 355010.2 356264.6 353144.3
 [857] 514711.9 364686.9 521317.8 704716.8 309014.0 539767.3 700128.6 682598.4
 [865] 383314.6 360171.8 699870.7 371081.3 398256.7 412963.6 395751.5 473826.1
 [873] 421529.8 420797.7 406952.2 367677.7 367456.6 458630.0 411158.0 383590.1
 [881] 372171.2 405595.1 396686.8 591852.0 367962.4 368837.6 303064.0 306707.6
 [889] 297571.2 580107.9 302291.1 341275.6 549891.8 388452.0 404249.9 408553.8
 [897] 387595.4 373328.5 338386.1 376433.1 341358.1 385624.2 377950.8 362524.2
 [905] 365091.3 489507.4 442956.1 512760.9 363750.4 368431.7 414212.7 413442.1
 [913] 329398.6 399580.8 317261.5 282832.1 358343.4 454326.7 370799.0 397632.7
 [921] 372039.4 375648.7 373627.3 373627.3 365951.6 369410.1 369410.1 421699.8
 [929] 271764.5 422611.3 300031.9 375357.8 375357.8 457812.8 435907.3 465878.0
 [937] 442608.4 412838.5 492534.2 445876.9 440134.4 469119.9 323128.3 281100.1
 [945] 373015.4 389494.7 462520.2 415462.4 388889.6 419005.2 419005.2 447777.1
 [953] 437302.8 459335.0 451840.0 367731.3 402815.0 399092.1 336310.1 343006.2
 [961] 443320.9 405303.7 467749.0 390925.8 410137.0 394402.8 428198.1 410687.5
 [969] 393809.8 386958.9 475121.1 456029.7 398894.1 725154.7 339719.0 341969.5
 [977] 341969.5 289249.1 396591.2 327938.3 313275.7 300247.0 312116.7 434871.2
 [985] 430522.1 457244.4 457244.4 371724.8 417577.5 431065.3 355810.6 349483.6
 [993] 370611.9 439966.4 447702.9 373953.7 359725.8 360409.2 396772.3 625624.2
[1001] 438541.6 435495.1 392259.5 350775.4 380278.9 652330.2 604801.4 332167.2
[1009] 342456.4 415634.1 332670.5 379668.3 375424.7 580862.5 360705.5 362330.0
[1017] 357188.0 370693.4 409521.6 362161.5 457310.3 369844.7 680244.4 690016.2
[1025] 449226.3 308984.2 358189.7 348409.6 698511.8 365229.4 381781.0 411627.1
[1033] 465885.2 322528.9 388639.7 356656.3 365306.3 652310.6 369376.6 364354.4
[1041] 379358.1 596953.7 282560.4 532813.4 549891.8 373328.5 374090.2 333004.7
[1049] 351863.4 385688.7 450966.7 366397.5 364453.6 357588.7 313535.5 336016.2
[1057] 278077.6 343606.4 344976.7 359344.2 398019.5 459950.9 712415.6 344820.2
[1065] 300031.9 560462.1 332629.3 498796.2 428350.3 417920.0 437894.8 440182.8
[1073] 419303.7 435795.9 430546.2 416430.6 673527.3 453566.8 422482.3 431033.7
[1081] 427062.4 443320.9 411221.1 410137.0 399285.5 393492.3 414018.3 378461.9
[1089] 386196.8 353163.6 375435.8 325580.1 332585.3 335923.4 461795.2 319380.5
[1097] 350392.6 316742.7 324059.0 312116.7 391750.0 457244.4 435498.8 393114.8
[1105] 425862.6 372991.8 373474.3 355955.1 353415.4 405746.8 482596.9 380698.5
[1113] 375507.0 403400.7 375803.4

Converting the predicting output into a data frame

The output of the predict.grf() is a vector of predicted values. It is wiser to convert it into a data frame for further visualisation and analysis.

GRF_pred <- read_rds("data/rds/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)

In the code chunk below, cbind() is used to append the predicted values onto test_data

test_data_p <- cbind(test_data, GRF_pred_df)
write_rds(test_data_p, "data/rds/test_data_p.rds")

###7.6.3 Calculating Root Mean Square Error

The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.

rmse(test_data_p$resale_price, 
     test_data_p$GRF_pred)
[1] 31988.66

This means that the average difference between values predicted by a model and the actual values is high with a value of 32313.3 which indicates that data may be over fitted and have lesser predictive value than expected.

###7.6.4 Visualising the predicted values

Alternatively, scatterplot can also be used to visualise the actual resale price and the predicted resale price by using the code chunk below.

ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = resale_price)) +
  geom_point()

As observed, we can see that there is a positive correlation between resale_price and GRF_pred.